Document Representations

نویسندگان

  • Dorothea Blostein
  • Richard Zanibbi
  • George Nagy
  • Rob Harrap
چکیده

Many document representations are in use. Each representation explicitly encodes different aspects of a document. External document representations, using standard file formats (such as JPEG, postscript, HTML, LaTeX), are used to communicate document-data between programs. Internal document representations are used within document analysis or document production software, to store intermediate results in the transformation from the input to output document representation. These document representations are central to defining and solving document analysis problems. Issues that can be investigated include defining equivalence of documents and distance between documents, mathematically characterizing the mapping between document representations, characterizing the external information needed to carry out these mappings, and characterizing the differences between the forward and inverse mappings that occur during document analysis and document production. From our ongoing investigation of these issues, we present a summary of internal document representations used in the table-recognition literature, and case studies of external document representations in the domains of circuit diagrams and text documents.

منابع مشابه

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Semantic Regularities in Document Representations

Recent work exhibited that distributed word representations are good at capturing linguistic regularities in language. This allows vector-oriented reasoning based on simple linear algebra between words. Since many different methods have been proposed for learning document representations, it is natural to ask whether there is also linear structure in these learned representations to allow simil...

متن کامل

Category Theory as a Foundation for Document Processing

Documents, particularly electronic documents that are created, disseminated , and used with computers, have several representations. Users may wish to work with such electronic documents in any of a document's representations, and this can make it diicult to maintain consistency between the diierent representations of a document. Category theory provides insight into this problem. We begin by d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003